Fault Tolerance via Replication in Coarse Grain Data-Flow

نویسندگان

  • Anh Nguyen-Tuong
  • Andrew S. Grimshaw
  • John F. Karpovich
چکیده

Recent advances in network technology promise to make gigabit-per-second bandwidth between remote hosts a reality in the near future. This increase in bandwidth paves the way for increased exploitation of distributed computing resources. Coupled with advances in distributed memory parallel compiler technology, there is strong reason to believe that wide-area distributed parallel processing will be an increasingly popular and important programming paradigm. Parallelizing and distributing program sub-tasks has the potential to increase performance for many applications while also improving the overall utilization of system resources. Unfortunately, there is a downside. When a program is partitioned into sub-tasks, each sub-task is distributed to potentially a different processor. As the number of processors employed by an application increases so does the chance that the application will fail due to a host/ processor failure.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Fault Tolerance via Replication in Coarse Grain Data-Flow1

Recent advances in network technology promise to make gigabit-per-second bandwidth between remote hosts a reality in the near future. This increase in bandwidth paves the way for increased exploitation of distributed computing resources. Coupled with advances in distributed memory parallel compiler technology, there is strong reason to believe that wide-area distributed parallel processing will...

متن کامل

AR-SMT: Coarse-Grain Time Redundancy for High Performance General Purpose Processors

Time redundancy is a fault tolerance technique in which a task -either computation or communication -is performed multiple times on the same hardware. This technique is cheaper than other fault tolerance solutions that require some form of hardware redundancy, because it does not require replicated hardware. However, fault coverage may be lower with time redundancy as it only captures certain c...

متن کامل

A Rollback-Recovery Protocol for Wide Area Pipelined Data Flow Computations

It is argued that there is a significant class of pipelined large grain data flow computations whose wide area distribution and long running nature suggest a need for fault-tolerance, but for which existing approaches appear either costly or incomplete. An example, which motivated this paper, is the execution of queries over distributed databases. This paper presents an approach which exploits ...

متن کامل

Fault Tolerant Wide-Area Parallel Computing

Executing parallel applications across distributed networks introduces the problem of fault tolerance. A viable solution for fault tolerance must keep overhead manageable and not compromise the high performance objective of parallel processing. In this paper, we explore two options for achieving fault tolerance for a common class of parallel applications, single-program-multiple-data (SPMD). We...

متن کامل

Applying Low-Overhead Rollback-Recovery to Wide Area Distributed Query Processing

It is argued that there is a significant class of pipelined large grain data flow computations whose wide area distribution and long running nature suggest a need for fault-tolerance, but for which existing approaches appear either costly or incomplete. This paper presents an approach which exploits limited input from the application layer to implement a low overhead recovery protocol for such ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1995